The file ‘WeatherEDA_202005_v002.Rmd’ contains exploratory data analysis for historical weather data as contained in METAR archives hosted by Iowa State University.
Data have been dowloaded, processed, cleaned, and integrated for several stations (airports) and years, with .rds files saved in “./RInputFiles/ProcessedMETAR”.
This module will perform initial modeling on the processed weather files. It builds on the previous ‘WeatherModeling_202006_v001.Rmd’ and leverages functions in ‘WeatherModelingFunctions_v001.R’.
There are three main processed files available for further exploration:
metar_postEDA_20200617.rds and metar_postEDA_extra_20200627.rds
metar_modifiedClouds_20200617.rds and metar_modifiedclouds_extra_20200627.rds
metar_precipLists_20200617.rds and metar_precipLists_extra_20200627.rds
Several mapping files are defined for use in plotting; tidyverse, lubridate, and caret are loaded; and the relevant functions are sourced:
# The process frequently uses tidyverse, lubridate, caret, and randomForest
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.4
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ---------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
# The main path for the files
filePath <- "./RInputFiles/ProcessedMETAR/"
# Sourcing functions
source("./WeatherModelingFunctions_v001.R")
# Descriptive names for key variables
varMapper <- c(source="Original source file name",
locale="Descriptive name",
dtime="Date-Time (UTC)",
origMETAR="Original METAR",
year="Year",
monthint="Month",
month="Month",
day="Day of Month",
WindDir="Wind Direction (degrees)",
WindSpeed="Wind Speed (kts)",
WindGust="Wind Gust (kts)",
predomDir="General Prevailing Wind Direction",
Visibility="Visibility (SM)",
Altimeter="Altimeter (inches Hg)",
TempF="Temperature (F)",
DewF="Dew Point (F)",
modSLP="Sea-Level Pressure (hPa)",
cType1="First Cloud Layer Type",
cLevel1="First Cloud Layer Height (ft)",
isRain="Rain at Observation Time",
isSnow="Snow at Observation Time",
isThunder="Thunder at Obsevation Time",
tempFHi="24-hour High Temperature (F)",
tempFLo="24-hour Low Temperature (F)",
minHeight="Minimum Cloud Height (ft)",
minType="Obscuration Level at Minimum Cloud Height",
ceilingHeight="Minimum Ceiling Height (ft)",
ceilingType="Obscuration Level at Minimum Ceiling Height",
hr="Hour of Day (Zulu time)"
)
# File name to city name mapper
cityNameMapper <- c(katl_2016="Atlanta, GA (2016)",
kbos_2016="Boston, MA (2016)",
kdca_2016="Washington, DC (2016)",
kden_2016="Denver, CO (2016)",
kdfw_2016="Dallas, TX (2016)",
kdtw_2016="Detroit, MI (2016)",
kewr_2016="Newark, NJ (2016)",
kgrb_2016="Green Bay, WI (2016)",
kgrr_2016="Grand Rapids, MI (2016)",
kiah_2016="Houston, TX (2016)",
kind_2016="Indianapolis, IN (2016)",
klas_2014="Las Vegas, NV (2014)",
klas_2015="Las Vegas, NV (2015)",
klas_2016="Las Vegas, NV (2016)",
klas_2017="Las Vegas, NV (2017)",
klas_2018="Las Vegas, NV (2018)",
klas_2019="Las Vegas, NV (2019)",
klax_2016="Los Angeles, CA (2016)",
klnk_2016="Lincoln, NE (2016)",
kmia_2016="Miami, FL (2016)",
kmke_2016="Milwaukee, WI (2016)",
kmsn_2016="Madison, WI (2016)",
kmsp_2016="Minneapolis, MN (2016)",
kmsy_2014="New Orleans, LA (2014)",
kmsy_2015="New Orleans, LA (2015)",
kmsy_2016="New Orleans, LA (2016)",
kmsy_2017="New Orleans, LA (2017)",
kmsy_2018="New Orleans, LA (2018)",
kmsy_2019="New Orleans, LA (2019)",
kord_2014="Chicago, IL (2014)",
kord_2015="Chicago, IL (2015)",
kord_2016="Chicago, IL (2016)",
kord_2017="Chicago, IL (2017)",
kord_2018="Chicago, IL (2018)",
kord_2019="Chicago, IL (2019)",
kphl_2016="Philadelphia, PA (2016)",
kphx_2016="Phoenix, AZ (2016)",
ksan_2014="San Diego, CA (2014)",
ksan_2015="San Diego, CA (2015)",
ksan_2016="San Diego, CA (2016)",
ksan_2017="San Diego, CA (2017)",
ksan_2018="San Diego, CA (2018)",
ksan_2019="San Diego, CA (2019)",
ksat_2016="San Antonio, TX (2016)",
ksea_2016="Seattle, WA (2016)",
ksfo_2016="San Francisco, CA (2016)",
ksjc_2016="San Jose, CA (2016)",
kstl_2016="Saint Louis, MO (2016)",
ktpa_2016="Tampa Bay, FL (2016)",
ktvc_2016="Traverse City, MI (2016)"
)
# File names in 2016, based on cityNameMapper
names_2016 <- grep(names(cityNameMapper), pattern="[a-z]{3}_2016", value=TRUE)
The main data will be from the metar_postEDA files. They are integrated below, and cloud and ceiling heights are converted to factors:
# Main weather data
metarData <- readRDS("./RInputFiles/ProcessedMETAR/metar_postEDA_20200617.rds") %>%
bind_rows(readRDS("./RInputFiles/ProcessedMETAR/metar_postEDA_extra_20200627.rds")) %>%
mutate(orig_minHeight=minHeight,
orig_ceilingHeight=ceilingHeight,
minHeight=mapCloudHeight(minHeight),
ceilingHeight=mapCloudHeight(ceilingHeight)
)
glimpse(metarData)
## Observations: 437,417
## Variables: 43
## $ source <chr> "kdtw_2016", "kdtw_2016", "kdtw_2016", "kdtw_201...
## $ locale <chr> "Detroit, MI (2016)", "Detroit, MI (2016)", "Det...
## $ dtime <dttm> 2016-01-01 00:53:00, 2016-01-01 01:53:00, 2016-...
## $ origMETAR <chr> "KDTW 010053Z 23012KT 10SM OVC020 00/M05 A3021 R...
## $ year <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, ...
## $ monthint <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ month <fct> Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan...
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ WindDir <chr> "230", "230", "230", "240", "230", "220", "220",...
## $ WindSpeed <int> 12, 12, 11, 14, 16, 13, 14, 16, 13, 16, 17, 13, ...
## $ WindGust <dbl> NA, NA, NA, 23, 22, NA, 20, 20, NA, 22, NA, NA, ...
## $ predomDir <fct> SW, SW, SW, SW, SW, SW, SW, SW, SW, SW, SW, SW, ...
## $ Visibility <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 8, 5, 7, 8, 10, ...
## $ Altimeter <dbl> 30.21, 30.21, 30.19, 30.19, 30.18, 30.16, 30.14,...
## $ TempF <dbl> 32.00, 32.00, 32.00, 30.92, 30.92, 32.00, 30.92,...
## $ DewF <dbl> 23.00, 21.92, 21.02, 19.94, 19.94, 19.94, 19.94,...
## $ modSLP <dbl> 1023.6, 1023.5, 1023.0, 1023.0, 1022.7, 1022.0, ...
## $ cType1 <chr> "OVC", "OVC", "OVC", "OVC", "OVC", "OVC", "OVC",...
## $ cType2 <chr> "", "", "", "", "", "", "", "", "", "OVC", "OVC"...
## $ cType3 <chr> "", "", "", "", "", "", "", "", "", "", "", "", ...
## $ cType4 <chr> "", "", "", "", "", "", "", "", "", "", "", "", ...
## $ cType5 <chr> "", "", "", "", "", "", "", "", "", "", "", "", ...
## $ cType6 <chr> "", "", "", "", "", "", "", "", "", "", "", "", ...
## $ cLevel1 <dbl> 2000, 2000, 2200, 2500, 2700, 2500, 2500, 2500, ...
## $ cLevel2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 4500, 4000, ...
## $ cLevel3 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ cLevel4 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ cLevel5 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ cLevel6 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ isRain <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ isSnow <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ isThunder <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ p1Inches <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 0, 0, 0, 0, 0, N...
## $ p36Inches <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 0, NA, NA, 0, NA...
## $ p24Inches <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ tempFHi <dbl> NA, NA, NA, NA, 36, NA, NA, NA, NA, NA, NA, NA, ...
## $ tempFLo <dbl> NA, NA, NA, NA, 31, NA, NA, NA, NA, NA, NA, NA, ...
## $ minHeight <fct> Low, Low, Low, Low, Low, Low, Low, Low, Low, Low...
## $ minType <fct> OVC, OVC, OVC, OVC, OVC, OVC, OVC, OVC, OVC, BKN...
## $ ceilingHeight <fct> Low, Low, Low, Low, Low, Low, Low, Low, Low, Low...
## $ ceilingType <fct> OVC, OVC, OVC, OVC, OVC, OVC, OVC, OVC, OVC, BKN...
## $ orig_minHeight <dbl> 2000, 2000, 2200, 2500, 2700, 2500, 2500, 2500, ...
## $ orig_ceilingHeight <dbl> 2000, 2000, 2200, 2500, 2700, 2500, 2500, 2500, ...
An initial exploration of the data can use hierarchical clustering based on several numerical features of the data (temperature, dew point, sea-level pressure, wind speed) by month.
Example code includes:
# Create hierarchical clustering for 2016 data
# Distance is calculated only where month is the same
distData <- metarData %>%
filter(year==2016) %>%
select(locale, month, TempF, DewF, modSLP, WindSpeed) %>%
group_by(locale, month) %>%
summarize_if(is.numeric, mean, na.rm=TRUE) %>%
ungroup() %>%
mutate(rowname=paste0(locale, "_", month)) %>%
column_to_rownames() %>%
select(-locale, -month) %>%
as.matrix() %>%
scale() %>%
dist() %>%
as.matrix() %>%
as.data.frame() %>%
rownames_to_column(var="locale1") %>%
pivot_longer(-locale1, names_to="locale2", values_to="dist") %>%
mutate(city1=str_replace(locale1, "_\\w{3}", ""),
city2=str_replace(locale2, "_\\w{3}", ""),
month1=str_sub(locale1, -3),
month2=str_sub(locale2, -3)) %>%
filter(month1==month2) %>%
group_by(city1, city2) %>%
summarize(meandist=mean(dist), sddist=sd(dist)) %>%
ungroup() %>%
select(-sddist) %>%
pivot_wider(city1, names_from="city2", values_from="meandist") %>%
column_to_rownames(var="city1") %>%
as.matrix() %>%
as.dist(diag=FALSE)
distData %>%
hclust(method="complete") %>%
plot()
distData %>%
hclust(method="single") %>%
plot()
At a glance, there are several sensible findings from the clustering:
There are a handful of cities that do not seem to cluster with anything else:
Suppose that the only information available about two cities were their temperatures and dew points, with month also included. How well would a basic random forest, with mtry=3, classify the cities?
The function is then run for every combination of locales from 2016 in cityNameMapper. A common random seed is applied to every run of the process:
# Create a container list to hold the output
list_2016_TempF_DewF_month <- vector("list", 0.5*length(names_2016)*(length(names_2016)-1))
set.seed(2006281443)
n <- 1
for (ctr in 1:(length(names_2016)-1)) {
for (ctr2 in (ctr+1):length(names_2016)) {
list_2016_TempF_DewF_month[[n]] <- rfTwoLocales(metarData,
loc1=names_2016[ctr],
loc2=names_2016[ctr2],
vrbls=c("TempF", "DewF", "month"),
ntree=25
)
n <- n + 1
if ((n %% 50) == 0) { cat("Through number:", n, "\n")}
}
}
## Through number: 50
## Through number: 100
## Through number: 150
## Through number: 200
## Through number: 250
## Through number: 300
## Through number: 350
## Through number: 400
Accuracy can then be assessed:
# Create a tibble from the underlying accuracy data
acc_TempF_DewF_month <- map_dfr(list_2016_TempF_DewF_month, .f=helperAccuracyLocale)
# Assess the top 10 classification accuracies
acc_TempF_DewF_month %>%
arrange(-accOverall) %>%
head(10)
## # A tibble: 10 x 5
## locale1 locale2 accOverall accLocale1 accLocale2
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Denver, CO (2016) Miami, FL (2016) 0.997 0.997 0.997
## 2 Las Vegas, NV (2016) Miami, FL (2016) 0.990 0.991 0.989
## 3 Miami, FL (2016) Seattle, WA (2016) 0.985 0.985 0.985
## 4 Denver, CO (2016) Tampa Bay, FL (2016) 0.984 0.986 0.982
## 5 Miami, FL (2016) Traverse City, MI (201~ 0.980 0.981 0.980
## 6 Miami, FL (2016) Minneapolis, MN (2016) 0.978 0.978 0.978
## 7 Boston, MA (2016) Miami, FL (2016) 0.976 0.975 0.976
## 8 Denver, CO (2016) New Orleans, LA (2016) 0.975 0.975 0.975
## 9 Miami, FL (2016) Phoenix, AZ (2016) 0.974 0.973 0.975
## 10 Green Bay, WI (2016) Miami, FL (2016) 0.971 0.972 0.971
# Assess the bottom 10 classification accuracies
acc_TempF_DewF_month %>%
arrange(accOverall) %>%
head(10)
## # A tibble: 10 x 5
## locale1 locale2 accOverall accLocale1 accLocale2
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Newark, NJ (2016) Philadelphia, PA (20~ 0.580 0.591 0.570
## 2 Chicago, IL (2016) Milwaukee, WI (2016) 0.591 0.600 0.583
## 3 Chicago, IL (2016) Madison, WI (2016) 0.598 0.622 0.573
## 4 Detroit, MI (2016) Grand Rapids, MI (20~ 0.608 0.568 0.649
## 5 Chicago, IL (2016) Detroit, MI (2016) 0.610 0.633 0.587
## 6 Grand Rapids, MI (201~ Traverse City, MI (2~ 0.613 0.630 0.597
## 7 Madison, WI (2016) Milwaukee, WI (2016) 0.618 0.625 0.611
## 8 Green Bay, WI (2016) Madison, WI (2016) 0.619 0.586 0.652
## 9 Detroit, MI (2016) Madison, WI (2016) 0.619 0.639 0.598
## 10 Chicago, IL (2016) Grand Rapids, MI (20~ 0.621 0.647 0.595
allAccuracy_month <- select(acc_TempF_DewF_month,
locale=locale1,
other=locale2,
accOverall,
accLocale=accLocale1
) %>%
bind_rows(select(acc_TempF_DewF_month,
locale=locale2,
other=locale1,
accOverall,
accLocale=accLocale2
)
)
# Overall accuracy by location plot
allAccuracy_month %>%
group_by(locale) %>%
summarize_if(is.numeric, mean) %>%
ggplot(aes(x=fct_reorder(locale, accOverall), y=accOverall)) +
geom_point(size=2) +
geom_text(aes(y=accOverall+0.02, label=paste0(round(100*accOverall), "%"))) +
coord_flip() +
labs(x="",
y="Average Accuracy",
title="Average Accuracy Predicting Locale",
subtitle="Predictions made 1:1 to each other locale (average accuracy reported)",
caption="Temperature, Dew Point, Month as predictors\n(50% is baseline coinflip accuracy)"
) +
ylim(c(0.5, 1))
# Overall accuracy heatmap (FIX for better ordering of locales)
# allAccuracy_month %>%
# ggplot(aes(x=locale, y=other)) +
# geom_tile(aes(fill=accOverall)) +
# theme(axis.text.x=element_text(angle=90)) +
# scale_fill_continuous("Accuracy", high="darkblue", low="white") +
# labs(title="Accuracy Predicting Locale vs. Locale",
# caption="Temperature, Dew Point, and Month as predictors\n(50% is baseline coinflip accuracy)",
# x="",
# y=""
# )
Accuracy is best for cities with significantly different locales. The model is especially successful at pulling apart the desert cities (Las Vegas, Phoenix) from everything else, and the humid Florida cities (Miami, Tampa Bay) from everything else.
Next, the simple model is run to classify locale across the full 2016 dataset:
# Run random forest for all 2016 locales
rf_all_2016_TDm <- rfMultiLocale(metarData,
vrbls=c("TempF", "DewF", "month"),
locs=names_2016,
ntree=50,
seed=2006281508
)
Summaries can then be created for the accuracy in predicting each locale:
evalPredictions(rf_all_2016_TDm, plotCaption = "Temp, Dew Point, Month, Cloud/Ceiling Height")
## # A tibble: 898 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Atlanta, GA (2016) Atlanta, GA (2016) TRUE 484 0.182
## 2 Atlanta, GA (2016) Boston, MA (2016) FALSE 49 0.0185
## 3 Atlanta, GA (2016) Chicago, IL (2016) FALSE 39 0.0147
## 4 Atlanta, GA (2016) Dallas, TX (2016) FALSE 179 0.0675
## 5 Atlanta, GA (2016) Denver, CO (2016) FALSE 34 0.0128
## 6 Atlanta, GA (2016) Detroit, MI (2016) FALSE 42 0.0158
## 7 Atlanta, GA (2016) Grand Rapids, MI (2016) FALSE 48 0.0181
## 8 Atlanta, GA (2016) Green Bay, WI (2016) FALSE 42 0.0158
## 9 Atlanta, GA (2016) Houston, TX (2016) FALSE 96 0.0362
## 10 Atlanta, GA (2016) Indianapolis, IN (2016) FALSE 49 0.0185
## # ... with 888 more rows
Accuracy is meaningfully increased compared to the null accuracy, though there is still under 50% accuracy in classifying any locale. This is suggestive that the goal will be to define some archetype climates, and to use those for predicting (e.g., humid, cold, desert, etc.).
Clouds (minimum levels and ceiling heights) can potentially help further differentiate the cold weather cities on the downwind side of the lake, as well as helping to further pull apart various marine and mid-continental climates.
An updated random forest model is then run:
# Run random forest for all 2016 locales
rf_all_2016_TDmc <- rfMultiLocale(metarData,
vrbls=c("TempF", "DewF", "month", "minHeight", "ceilingHeight"),
locs=names_2016,
ntree=50,
seed=2006281509
)
The evaluation process is converted to a function:
evalPredictions(rf_all_2016_TDmc, plotCaption = "Temp, Dew Point, Month, Cloud/Ceiling Height")
## # A tibble: 897 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Atlanta, GA (2016) Atlanta, GA (2016) TRUE 643 0.240
## 2 Atlanta, GA (2016) Boston, MA (2016) FALSE 84 0.0313
## 3 Atlanta, GA (2016) Chicago, IL (2016) FALSE 61 0.0227
## 4 Atlanta, GA (2016) Dallas, TX (2016) FALSE 145 0.0541
## 5 Atlanta, GA (2016) Denver, CO (2016) FALSE 44 0.0164
## 6 Atlanta, GA (2016) Detroit, MI (2016) FALSE 51 0.0190
## 7 Atlanta, GA (2016) Grand Rapids, MI (2016) FALSE 49 0.0183
## 8 Atlanta, GA (2016) Green Bay, WI (2016) FALSE 26 0.00969
## 9 Atlanta, GA (2016) Houston, TX (2016) FALSE 82 0.0306
## 10 Atlanta, GA (2016) Indianapolis, IN (2016) FALSE 100 0.0373
## # ... with 887 more rows
# Get the comparison to previous accuracy
deltaAccuracy(prevObject=rf_all_2016_TDm, currObject=rf_all_2016_TDmc)
Accuracy increases meaningfully, though still well under 50% in most cases.
The model can also be built out to consider wind speed and wind direction. No attempt yet is made to control for over-fitting:
# Run random forest for all 2016 locales
rf_all_2016_TDmcw <- rfMultiLocale(metarData,
vrbls=c("TempF", "DewF",
"month",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir"
),
locs=names_2016,
ntree=50,
seed=2006281934
)
The evaluation process can again be run:
# Get the graphs of the prediction accuracy
evalPredictions(rf_all_2016_TDmcw,
plotCaption = "Temp, Dew Point, Month, Cloud/Ceiling Height, Wind Speed/Direction"
)
## # A tibble: 889 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Atlanta, GA (2016) Atlanta, GA (2016) TRUE 1096 0.415
## 2 Atlanta, GA (2016) Boston, MA (2016) FALSE 34 0.0129
## 3 Atlanta, GA (2016) Chicago, IL (2016) FALSE 46 0.0174
## 4 Atlanta, GA (2016) Dallas, TX (2016) FALSE 135 0.0511
## 5 Atlanta, GA (2016) Denver, CO (2016) FALSE 28 0.0106
## 6 Atlanta, GA (2016) Detroit, MI (2016) FALSE 31 0.0117
## 7 Atlanta, GA (2016) Grand Rapids, MI (2016) FALSE 26 0.00984
## 8 Atlanta, GA (2016) Green Bay, WI (2016) FALSE 24 0.00908
## 9 Atlanta, GA (2016) Houston, TX (2016) FALSE 58 0.0219
## 10 Atlanta, GA (2016) Indianapolis, IN (2016) FALSE 59 0.0223
## # ... with 879 more rows
# Get the comparison to previous accuracy
deltaAccuracy(prevObject=rf_all_2016_TDmc, currObject=rf_all_2016_TDmcw)
Including wind significantly improves model accuracy for many locales. Even the cold weather cities are now being predicted with around 30%-40% accucacy.
The model can also be built out to consider hour of day and sea-level pressure. No attempt yet is made to control for over-fitting, with the exception that mtry is held at 4:
# Run random forest for all 2016 locales
rf_all_2016_TDmcwa <- rfMultiLocale(metarData,
vrbls=c("TempF", "DewF",
"month",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=names_2016,
ntree=50,
mtry=4,
seed=2006282103
)
The evaluation process can again be run:
# Get the graphs of the prediction accuracy
evalPredictions(rf_all_2016_TDmcwa,
plotCaption = "Temp, Dew Point, Month, Cloud/Ceiling Height, Wind Speed/Direction, SLP"
)
## # A tibble: 880 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Atlanta, GA (2016) Atlanta, GA (2016) TRUE 1639 0.625
## 2 Atlanta, GA (2016) Boston, MA (2016) FALSE 16 0.00610
## 3 Atlanta, GA (2016) Chicago, IL (2016) FALSE 23 0.00877
## 4 Atlanta, GA (2016) Dallas, TX (2016) FALSE 75 0.0286
## 5 Atlanta, GA (2016) Denver, CO (2016) FALSE 8 0.00305
## 6 Atlanta, GA (2016) Detroit, MI (2016) FALSE 18 0.00686
## 7 Atlanta, GA (2016) Grand Rapids, MI (2016) FALSE 12 0.00457
## 8 Atlanta, GA (2016) Green Bay, WI (2016) FALSE 6 0.00229
## 9 Atlanta, GA (2016) Houston, TX (2016) FALSE 45 0.0172
## 10 Atlanta, GA (2016) Indianapolis, IN (2016) FALSE 51 0.0194
## # ... with 870 more rows
# Get the comparison to previous accuracy
deltaAccuracy(prevObject=rf_all_2016_TDmcw, currObject=rf_all_2016_TDmcwa)
Adding sea-level pressure again significantly improves prediction ability. The cold weather cities are in the roughly 40%-60% accuracy range, and the other cities are in the roughly 60%-80% accuracy range.
Based on the hierarchical clustering and results of the initial random forest models, a few archetype cities are defined in an attempt to pull out distinctions:
A data subset is created, with “hr” added for the Zulu hour of the observation (as integer rather than as factor):
archeCities <- c("kmia_2016", "klas_2016", "klax_2016", "kden_2016",
"ksea_2016", "kdfw_2016", "kmsp_2016", "katl_2016"
)
archeData <- metarData %>%
filter(source %in% archeCities) %>%
mutate(hr=lubridate::hour(dtime))
A random forest, with mtry=4, is then run using all variables from previous, as well as hour:
# Run random forest for all 2016 locales
rf_arche_2016_TDmcwa <- rfMultiLocale(archeData,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL, # defaults to running for all locales
ntree=100,
mtry=4,
seed=2006291353
)
##
## Running for locations:
## [1] "katl_2016" "kden_2016" "kdfw_2016" "klas_2016" "klax_2016" "kmia_2016"
## [7] "kmsp_2016" "ksea_2016"
The evaluation process can again be run:
# Get the graphs of the prediction accuracy
evalPredictions(rf_arche_2016_TDmcwa,
plotCaption = "Temp, Dew Point, Month/Hour, Cloud/Ceiling Height, Wind Speed/Direction, SLP"
)
## # A tibble: 62 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Atlanta, GA (2016) Atlanta, GA (2016) TRUE 2204 0.835
## 2 Atlanta, GA (2016) Dallas, TX (2016) FALSE 139 0.0527
## 3 Atlanta, GA (2016) Denver, CO (2016) FALSE 39 0.0148
## 4 Atlanta, GA (2016) Las Vegas, NV (2016) FALSE 33 0.0125
## 5 Atlanta, GA (2016) Los Angeles, CA (2016) FALSE 86 0.0326
## 6 Atlanta, GA (2016) Miami, FL (2016) FALSE 50 0.0189
## 7 Atlanta, GA (2016) Minneapolis, MN (2016) FALSE 45 0.0170
## 8 Atlanta, GA (2016) Seattle, WA (2016) FALSE 44 0.0167
## 9 Dallas, TX (2016) Atlanta, GA (2016) FALSE 203 0.0789
## 10 Dallas, TX (2016) Dallas, TX (2016) TRUE 2105 0.818
## # ... with 52 more rows
The model is reasonably successful in pulling apart the archetypes. Denver and Dallas seem potentially less distinct, though all archetypes cities are predicted at 80%+ accuracy.
The remaining cities can be classified against the archetype data, to explore any patterns that emerge:
localeByArchetype(rf_arche_2016_TDmcwa,
fullData=metarData,
archeCities=archeCities,
sortDescMatch=TRUE
)
There appear to be several issues:
The first issue is addressed by collapsing Atlanta, Dallas, and Miami and instead using Houston as the archetype; and replacing Minneapolis with Chicago:
archeCities_v02 <- c("kiah_2016", "klas_2016", "klax_2016",
"kden_2016", "ksea_2016", "kord_2016"
)
archeData_v02 <- metarData %>%
filter(source %in% archeCities_v02) %>%
mutate(hr=lubridate::hour(dtime))
A random forest, with mtry=4, is then run using all variables from previous, as well as hour:
# Run random forest for all 2016 locales in archeData_v02
rf_arche_2016_TDmcwa_v02 <- rfMultiLocale(archeData_v02,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL, # defaults to running for all locales
ntree=100,
mtry=4,
seed=2006291448
)
##
## Running for locations:
## [1] "kden_2016" "kiah_2016" "klas_2016" "klax_2016" "kord_2016" "ksea_2016"
The evaluation process can again be run:
# Get the graphs of the prediction accuracy
evalPredictions(rf_arche_2016_TDmcwa_v02,
plotCaption = "Temp, Dew Point, Month/Hour, Cloud/Ceiling Height, Wind Speed/Direction, SLP"
)
## # A tibble: 36 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Chicago, IL (2016) Chicago, IL (2016) TRUE 2321 0.872
## 2 Chicago, IL (2016) Denver, CO (2016) FALSE 99 0.0372
## 3 Chicago, IL (2016) Houston, TX (2016) FALSE 54 0.0203
## 4 Chicago, IL (2016) Las Vegas, NV (2016) FALSE 13 0.00489
## 5 Chicago, IL (2016) Los Angeles, CA (2016) FALSE 70 0.0263
## 6 Chicago, IL (2016) Seattle, WA (2016) FALSE 104 0.0391
## 7 Denver, CO (2016) Chicago, IL (2016) FALSE 119 0.0453
## 8 Denver, CO (2016) Denver, CO (2016) TRUE 2282 0.869
## 9 Denver, CO (2016) Houston, TX (2016) FALSE 3 0.00114
## 10 Denver, CO (2016) Las Vegas, NV (2016) FALSE 126 0.0480
## # ... with 26 more rows
The model is reasonably successful in pulling apart the archetypes. Houston appears the most distinct (as had Miami previously), and there is less overlap going to/from Denver or Seattle.
The remaining cities can be classified against the archetype data, to explore any patterns that emerge:
localeByArchetype(rf_arche_2016_TDmcwa_v02,
fullData=metarData,
archeCities=archeCities_v02,
sortDescMatch = TRUE
)
At a glance, the archetypes are more sensible. There is a bit of Denver and a bit of Seattle in most of the cold weather cities, and a bit of Houston in some of the warmer cold weather cities. This is suggestive that there are either more then one type of cold-weather cities, or that cold-weather cities tend to be like other archetypes are certain times.
The model is run again, deleting Denver and Seattle, and converting Los Angeles to San Diego:
archeCities_v03 <- c("kiah_2016", "klas_2016", "ksan_2016", "kord_2016")
archeData_v03 <- metarData %>%
filter(source %in% archeCities_v03) %>%
mutate(hr=lubridate::hour(dtime))
A random forest, with mtry=4, is then run using all variables from previous, as well as hour:
# Run random forest for all 2016 locales in archeData_v03
rf_arche_2016_TDmcwa_v03 <- rfMultiLocale(archeData_v03,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL, # defaults to running for all locales
ntree=100,
mtry=4,
seed=2006291459
)
##
## Running for locations:
## [1] "kiah_2016" "klas_2016" "kord_2016" "ksan_2016"
The evaluation process can again be run:
# Get the graphs of the prediction accuracy
evalPredictions(rf_arche_2016_TDmcwa_v03,
plotCaption = "Temp, Dew Point, Month/Hour, Cloud/Ceiling Height, Wind Speed/Direction, SLP"
)
## Warning: Removed 1 rows containing missing values (geom_text).
## # A tibble: 16 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Chicago, IL (2016) Chicago, IL (2016) TRUE 2511 0.946
## 2 Chicago, IL (2016) Houston, TX (2016) FALSE 69 0.0260
## 3 Chicago, IL (2016) Las Vegas, NV (2016) FALSE 26 0.00979
## 4 Chicago, IL (2016) San Diego, CA (2016) FALSE 49 0.0185
## 5 Houston, TX (2016) Chicago, IL (2016) FALSE 44 0.0170
## 6 Houston, TX (2016) Houston, TX (2016) TRUE 2473 0.954
## 7 Houston, TX (2016) Las Vegas, NV (2016) FALSE 19 0.00733
## 8 Houston, TX (2016) San Diego, CA (2016) FALSE 57 0.0220
## 9 Las Vegas, NV (2016) Chicago, IL (2016) FALSE 29 0.0113
## 10 Las Vegas, NV (2016) Houston, TX (2016) FALSE 13 0.00508
## 11 Las Vegas, NV (2016) Las Vegas, NV (2016) TRUE 2482 0.971
## 12 Las Vegas, NV (2016) San Diego, CA (2016) FALSE 33 0.0129
## 13 San Diego, CA (2016) Chicago, IL (2016) FALSE 35 0.0130
## 14 San Diego, CA (2016) Houston, TX (2016) FALSE 29 0.0108
## 15 San Diego, CA (2016) Las Vegas, NV (2016) FALSE 63 0.0235
## 16 San Diego, CA (2016) San Diego, CA (2016) TRUE 2559 0.953
These archetypes are distinct, with accuracies all in the 95% range. The remaining cities can be classified against the archetype data, to explore any patterns that emerge:
localeByArchetype(rf_arche_2016_TDmcwa_v03,
fullData=metarData,
archeCities=archeCities_v03,
sortDescMatch=TRUE
)
Quite a few cities show as blends of the archetype cities, which is likely correct if archetypes are considered to be Cold, Humid, Marine, Desert. An open question is whether to expand that list with another 1-2 archetypes that attempt to better capture Atlanta, Dallas, Denver, St Louis, and DC.
The Marine archetype may need to be converted to Pacific and to better identify the California entities.
A decision about Seattle is needed, as it is a little like everything and everything is a little like it.
To address the issue of the model learning from specific cities rather than archetypes, user-defined clusters will be created, and data from these clusters will be used in modeling:
Further, a variable will be created for “hr”, the Zulu hour of the observation:
# Create the locale mapper
locMapperTibble <- tibble::tribble(
~source, ~locType,
'katl_2016', 'South',
'kbos_2016', 'Exclude',
'kdca_2016', 'Exclude',
'kden_2016', 'Exclude',
'kdfw_2016', 'South',
'kdtw_2016', 'Cold',
'kewr_2016', 'Exclude',
'kgrb_2016', 'Cold',
'kgrr_2016', 'Cold',
'kiah_2016', 'Humid',
'kind_2016', 'Mixed',
'klas_2016', 'Desert',
'klax_2016', 'Marine',
'klnk_2016', 'Mixed',
'kmia_2016', 'Humid',
'kmke_2016', 'Cold',
'kmsn_2016', 'Cold',
'kmsp_2016', 'Cold',
'kmsy_2016', 'Humid',
'kord_2016', 'Cold',
'kphl_2016', 'Exclude',
'kphx_2016', 'Desert',
'ksan_2016', 'Marine',
'ksat_2016', 'South',
'ksea_2016', 'Exclude',
'ksfo_2016', 'Exclude',
'ksjc_2016', 'Exclude',
'kstl_2016', 'Mixed',
'ktpa_2016', 'Humid',
'ktvc_2016', 'Cold'
)
# Create locMapper
locMapper <- locMapperTibble %>% pull(locType)
names(locMapper) <- locMapperTibble %>% pull(source)
locMapper
## katl_2016 kbos_2016 kdca_2016 kden_2016 kdfw_2016 kdtw_2016 kewr_2016 kgrb_2016
## "South" "Exclude" "Exclude" "Exclude" "South" "Cold" "Exclude" "Cold"
## kgrr_2016 kiah_2016 kind_2016 klas_2016 klax_2016 klnk_2016 kmia_2016 kmke_2016
## "Cold" "Humid" "Mixed" "Desert" "Marine" "Mixed" "Humid" "Cold"
## kmsn_2016 kmsp_2016 kmsy_2016 kord_2016 kphl_2016 kphx_2016 ksan_2016 ksat_2016
## "Cold" "Cold" "Humid" "Cold" "Exclude" "Desert" "Marine" "South"
## ksea_2016 ksfo_2016 ksjc_2016 kstl_2016 ktpa_2016 ktvc_2016
## "Exclude" "Exclude" "Exclude" "Mixed" "Humid" "Cold"
# Create the data file with locType
locData_v04 <- metarData %>%
mutate(locType=locMapper[source], hr=lubridate::hour(dtime)) %>%
filter(year==2016, locType !="Exclude")
# Counts by locType
locData_v04 %>%
count(locType)
## # A tibble: 6 x 2
## locType n
## <chr> <int>
## 1 Cold 70101
## 2 Desert 17543
## 3 Humid 35074
## 4 Marine 17536
## 5 Mixed 26187
## 6 South 26309
There is a risk that the model will predict neutral cities as ‘Cold’ given the 4:1 ratio of Cold cities to Marine/Desert cities. The random forest model is first run on the data ‘as is’:
rf_arche_small_TDmcwa_v04 <- rfMultiLocale(locData_v04,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL, # defaults to running for all locales
ntree=100,
locVar="locType",
pred="locType",
mtry=4,
otherVar=c("dtime", "source", "locale"),
seed=2006301313
)
##
## Running for locations:
## [1] "Cold" "Desert" "Humid" "Marine" "Mixed" "South"
Prediction accuracy is then assessed:
# Get the graphs of the prediction accuracy
evalPredictions(rf_arche_small_TDmcwa_v04,
plotCaption = "Temp, Dew Point, Month/Hour, Cloud/Ceiling Height, Wind Speed/Direction, SLP",
keyVar="locType"
)
## # A tibble: 36 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Cold Cold TRUE 19819 0.947
## 2 Cold Desert FALSE 37 0.00177
## 3 Cold Humid FALSE 119 0.00568
## 4 Cold Marine FALSE 159 0.00759
## 5 Cold Mixed FALSE 549 0.0262
## 6 Cold South FALSE 254 0.0121
## 7 Desert Cold FALSE 122 0.0234
## 8 Desert Desert TRUE 4790 0.921
## 9 Desert Humid FALSE 41 0.00788
## 10 Desert Marine FALSE 115 0.0221
## # ... with 26 more rows
The designation of Cold, Desert, Humid, and Marine seems successful. The attempt to split South from Humid resulted in good, if incomplete, separation. The attempt to separate a group of ‘warmer’ cold-weather cities from ‘colder’ cold-weather cities was not particularly successful. This could partially be an artifact of ‘Cold’ having double the data volume as ‘Mixed’.
Classifications by city can also be examined:
archeCities_v04 <- locMapperTibble %>%
filter(locType != "Exclude") %>%
pull(source)
localeByArchetype(rf_arche_small_TDmcwa_v04,
fullData=mutate(metarData, locType=locMapper[source]),
archeCities=archeCities_v04,
sortDescMatch=TRUE
)
Many of the cities are nicely classified in to their assigned archetypes, as desired. Among the cities used in the classifications, concerns include:
Classifications are changed somewhat, and data are then filtered so that an equal number of observations from each locale type are applied in the modeling:
# Create the locale mapper
locMapperTibble_v05 <- tibble::tribble(
~source, ~locType,
'katl_2016', 'Exclude',
'kbos_2016', 'Exclude',
'kdca_2016', 'Exclude',
'kden_2016', 'Exclude',
'kdfw_2016', 'Exclude',
'kdtw_2016', 'Cold',
'kewr_2016', 'Exclude',
'kgrb_2016', 'Cold',
'kgrr_2016', 'Cold',
'kiah_2016', 'Humid',
'kind_2016', 'Exclude',
'klas_2016', 'Desert',
'klax_2016', 'Marine',
'klnk_2016', 'Exclude',
'kmia_2016', 'Humid',
'kmke_2016', 'Cold',
'kmsn_2016', 'Cold',
'kmsp_2016', 'Cold',
'kmsy_2016', 'Humid',
'kord_2016', 'Cold',
'kphl_2016', 'Exclude',
'kphx_2016', 'Desert',
'ksan_2016', 'Marine',
'ksat_2016', 'Exclude',
'ksea_2016', 'Exclude',
'ksfo_2016', 'Exclude',
'ksjc_2016', 'Exclude',
'kstl_2016', 'Exclude',
'ktpa_2016', 'Humid',
'ktvc_2016', 'Cold'
)
# Create locMapper
locMapper_v05 <- locMapperTibble_v05 %>% pull(locType)
names(locMapper_v05) <- locMapperTibble_v05 %>% pull(source)
locMapper_v05
## katl_2016 kbos_2016 kdca_2016 kden_2016 kdfw_2016 kdtw_2016 kewr_2016 kgrb_2016
## "Exclude" "Exclude" "Exclude" "Exclude" "Exclude" "Cold" "Exclude" "Cold"
## kgrr_2016 kiah_2016 kind_2016 klas_2016 klax_2016 klnk_2016 kmia_2016 kmke_2016
## "Cold" "Humid" "Exclude" "Desert" "Marine" "Exclude" "Humid" "Cold"
## kmsn_2016 kmsp_2016 kmsy_2016 kord_2016 kphl_2016 kphx_2016 ksan_2016 ksat_2016
## "Cold" "Cold" "Humid" "Cold" "Exclude" "Desert" "Marine" "Exclude"
## ksea_2016 ksfo_2016 ksjc_2016 kstl_2016 ktpa_2016 ktvc_2016
## "Exclude" "Exclude" "Exclude" "Exclude" "Humid" "Cold"
# Create the data file with locType
locData_v05 <- metarData %>%
mutate(locType=locMapper_v05[source], hr=lubridate::hour(dtime)) %>%
filter(year==2016, locType !="Exclude")
# Counts by locType
locData_v05 %>%
count(locType)
## # A tibble: 4 x 2
## locType n
## <chr> <int>
## 1 Cold 70101
## 2 Desert 17543
## 3 Humid 35074
## 4 Marine 17536
The data are the filtered so that there are an equal number of observations from each locale type. The model can later be predicted against the remaining items:
# Set a seed for reporducibility
set.seed(2006301356)
# Find the smallest locale type
nSmall <- locData_v05 %>%
filter(!is.na(locType)) %>%
count(locType) %>%
pull(n) %>%
min()
# Create the relevant data subset
subData_v05 <- locData_v05 %>%
filter(!is.na(locType)) %>%
group_by(locType) %>%
sample_n(size=nSmall, replace=FALSE) %>%
ungroup()
# Sumarize the data subset
subData_v05 %>%
count(locType, locale) %>%
arrange(-n)
## # A tibble: 16 x 3
## locType locale n
## <chr> <chr> <int>
## 1 Marine Los Angeles, CA (2016) 8774
## 2 Desert Phoenix, AZ (2016) 8770
## 3 Desert Las Vegas, NV (2016) 8766
## 4 Marine San Diego, CA (2016) 8762
## 5 Humid New Orleans, LA (2016) 4402
## 6 Humid Miami, FL (2016) 4388
## 7 Humid Houston, TX (2016) 4386
## 8 Humid Tampa Bay, FL (2016) 4360
## 9 Cold Grand Rapids, MI (2016) 2234
## 10 Cold Milwaukee, WI (2016) 2231
## 11 Cold Minneapolis, MN (2016) 2205
## 12 Cold Madison, WI (2016) 2188
## 13 Cold Traverse City, MI (2016) 2182
## 14 Cold Detroit, MI (2016) 2173
## 15 Cold Chicago, IL (2016) 2168
## 16 Cold Green Bay, WI (2016) 2155
The previous model can then be run on the data subset:
rf_arche_small_TDmcwa_v05 <- rfMultiLocale(subData_v05,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL, # defaults to running for all locales
ntree=100,
locVar="locType",
pred="locType",
mtry=4,
otherVar=c("dtime", "source", "locale"),
seed=2006301358
)
##
## Running for locations:
## [1] "Cold" "Desert" "Humid" "Marine"
The evaluation process can again be run:
# Get the graphs of the prediction accuracy
evalPredictions(rf_arche_small_TDmcwa_v05,
plotCaption = "Temp, Dew Point, Month/Hour, Cloud/Ceiling Height, Wind Speed/Direction, SLP",
keyVar="locType"
)
## # A tibble: 16 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Cold Cold TRUE 4864 0.931
## 2 Cold Desert FALSE 68 0.0130
## 3 Cold Humid FALSE 116 0.0222
## 4 Cold Marine FALSE 175 0.0335
## 5 Desert Cold FALSE 58 0.0112
## 6 Desert Desert TRUE 4925 0.955
## 7 Desert Humid FALSE 34 0.00659
## 8 Desert Marine FALSE 141 0.0273
## 9 Humid Cold FALSE 99 0.0188
## 10 Humid Desert FALSE 64 0.0121
## 11 Humid Humid TRUE 4916 0.932
## 12 Humid Marine FALSE 194 0.0368
## 13 Marine Cold FALSE 88 0.0166
## 14 Marine Desert FALSE 210 0.0396
## 15 Marine Humid FALSE 81 0.0153
## 16 Marine Marine TRUE 4920 0.928
The archetypes are well-separated, with roughly 95% accuracy in every case. Predictions can then also be made for all cities, including those not in the modeling:
archeCities_v05 <- locMapperTibble_v05 %>%
filter(locType != "Exclude") %>%
pull(source)
localeByArchetype(rf_arche_small_TDmcwa_v05,
fullData=mutate(metarData, locType=locMapper_v05[source]),
archeCities=archeCities_v05,
sortDescMatch=TRUE
)
Broadly speaking, cities used in modeling appear to classify in to the appropriate archetypes. For the cities not included in the modeling:
The model can be applied to data from years other than 2016, to see how the classifications are impacted by use in out-of-sample years:
# Predictions on non-2016 data
helperPredictPlot(rf_arche_small_TDmcwa_v05$rfModel,
df=filter(mutate(metarData, hr=lubridate::hour(dtime)), year!=2016),
predOrder=c("Marine", "Humid", "Desert", "Cold"),
locMapper=locMapper_v05
)
## # A tibble: 80 x 4
## locale predicted n pct
## <chr> <fct> <int> <dbl>
## 1 Chicago, IL (2014) Cold 7928 0.915
## 2 Chicago, IL (2014) Desert 273 0.0315
## 3 Chicago, IL (2014) Humid 144 0.0166
## 4 Chicago, IL (2014) Marine 318 0.0367
## 5 Chicago, IL (2015) Cold 7722 0.886
## 6 Chicago, IL (2015) Desert 247 0.0283
## 7 Chicago, IL (2015) Humid 340 0.0390
## 8 Chicago, IL (2015) Marine 411 0.0471
## 9 Chicago, IL (2017) Cold 7660 0.878
## 10 Chicago, IL (2017) Desert 466 0.0534
## # ... with 70 more rows
Model performance on non-2016 data is not as strong, with roughly a 10%-20% loss of accuracy. Predictions are still much better than null accuracy, and the model (mostly) continues to separate the archetypes.
Suppose that models are run on all 2015-2018 data for Chicago, Las Vegas, New Orleans, and San Diego:
# Create the subset for Chicago, Las Vegas, New Orleans, San Diego
sub_2015_2018_data <- metarData %>%
filter(str_sub(source, 1, 4) %in% c("kord", "klas", "kmsy", "ksan"),
year %in% c(2015, 2016, 2017, 2018)
) %>%
mutate(city=str_replace(locale, pattern=" .\\d{4}.", replacement=""),
hr=lubridate::hour(dtime)
)
# Check that proper locales are included
sub_2015_2018_data %>%
count(city, locale)
## # A tibble: 16 x 3
## city locale n
## <chr> <chr> <int>
## 1 Chicago, IL Chicago, IL (2015) 8728
## 2 Chicago, IL Chicago, IL (2016) 8767
## 3 Chicago, IL Chicago, IL (2017) 8740
## 4 Chicago, IL Chicago, IL (2018) 8737
## 5 Las Vegas, NV Las Vegas, NV (2015) 8727
## 6 Las Vegas, NV Las Vegas, NV (2016) 8770
## 7 Las Vegas, NV Las Vegas, NV (2017) 8664
## 8 Las Vegas, NV Las Vegas, NV (2018) 8746
## 9 New Orleans, LA New Orleans, LA (2015) 8727
## 10 New Orleans, LA New Orleans, LA (2016) 8767
## 11 New Orleans, LA New Orleans, LA (2017) 8723
## 12 New Orleans, LA New Orleans, LA (2018) 8737
## 13 San Diego, CA San Diego, CA (2015) 8737
## 14 San Diego, CA San Diego, CA (2016) 8762
## 15 San Diego, CA San Diego, CA (2017) 8744
## 16 San Diego, CA San Diego, CA (2018) 8744
The random forest model is run and cached:
# Run random forest for 2015-2018 data
rf_types_2015_2018_TDmcwha <- rfMultiLocale(sub_2015_2018_data,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL,
locVar="city",
pred="city",
ntree=50,
seed=2006301420,
mtry=4
)
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
evalPredictions(rf_types_2015_2018_TDmcwha,
plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, SLP",
keyVar="city"
)
## Warning: Removed 1 rows containing missing values (geom_text).
## # A tibble: 16 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Chicago, IL Chicago, IL TRUE 9847 0.941
## 2 Chicago, IL Las Vegas, NV FALSE 121 0.0116
## 3 Chicago, IL New Orleans, LA FALSE 236 0.0226
## 4 Chicago, IL San Diego, CA FALSE 258 0.0247
## 5 Las Vegas, NV Chicago, IL FALSE 153 0.0146
## 6 Las Vegas, NV Las Vegas, NV TRUE 10048 0.961
## 7 Las Vegas, NV New Orleans, LA FALSE 60 0.00574
## 8 Las Vegas, NV San Diego, CA FALSE 192 0.0184
## 9 New Orleans, LA Chicago, IL FALSE 224 0.0213
## 10 New Orleans, LA Las Vegas, NV FALSE 98 0.00932
## 11 New Orleans, LA New Orleans, LA TRUE 9859 0.938
## 12 New Orleans, LA San Diego, CA FALSE 329 0.0313
## 13 San Diego, CA Chicago, IL FALSE 173 0.0166
## 14 San Diego, CA Las Vegas, NV FALSE 281 0.0269
## 15 San Diego, CA New Orleans, LA FALSE 197 0.0189
## 16 San Diego, CA San Diego, CA TRUE 9789 0.938
Even with a small forest (50 trees), the model is almost always separating Las Vegas, Chicago, San Diego, and New Orleans. While the climates are very different in these cities, it is striking that the model has so few misclassifications.
How do other cities map against these classifications?
# Predictions on 2015/2018 data
helperPredictPlot(rf_types_2015_2018_TDmcwha$rfModel,
df=filter(mutate(metarData, hr=lubridate::hour(dtime)),
!(str_sub(source, 1, 4) %in% c("kord", "klas", "kmsy", "ksan"))
),
predOrder=c("Chicago, IL", "San Diego, CA", "New Orleans, LA", "Las Vegas, NV")
)
## # A tibble: 104 x 4
## locale predicted n pct
## <chr> <fct> <int> <dbl>
## 1 Atlanta, GA (2016) Chicago, IL 3363 0.384
## 2 Atlanta, GA (2016) Las Vegas, NV 809 0.0924
## 3 Atlanta, GA (2016) New Orleans, LA 3709 0.424
## 4 Atlanta, GA (2016) San Diego, CA 870 0.0994
## 5 Boston, MA (2016) Chicago, IL 7710 0.891
## 6 Boston, MA (2016) Las Vegas, NV 430 0.0497
## 7 Boston, MA (2016) New Orleans, LA 291 0.0336
## 8 Boston, MA (2016) San Diego, CA 223 0.0258
## 9 Dallas, TX (2016) Chicago, IL 2999 0.343
## 10 Dallas, TX (2016) Las Vegas, NV 1147 0.131
## # ... with 94 more rows
Classifications are broadly as expected based on the previous archetype analysis. Variable importances are plotted:
helperPlotVarImp(rf_types_2015_2018_TDmcwha$rfModel)
Dew point and temperature by month continue to be strong factors for separating the four cities in this analysis. SLP, minimum cloud height, and prevailing wind direction are also meaningful.
An assessment can be run for the 2015-2018 model:
# Run for the full model including SLP
probs_2015_2018_TDmcwha <-
assessPredictionCertainty(rf_types_2015_2018_TDmcwha,
keyVar="city",
plotCaption="Temp, Dew Point, Month/Hour, Clouds, Wind, SLP",
showAcc=TRUE
)
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(keyVar)` instead of `keyVar` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(pkVars)` instead of `pkVars` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
A similar process can be run for assessing the classification of the other cities against the 2015-2017 data for Chicago, Las Vegas, New Orleans, and San Diego:
useData <- metarData %>%
filter(!(str_sub(source, 1, 4) %in% c("kord", "klas", "kmsy", "ksan"))) %>%
mutate(hr=lubridate::hour(dtime))
# Run for the model excluding SLP
probs_allcities_2015_2018_TDmcwh <-
assessPredictionCertainty(rf_types_2015_2018_TDmcwha,
testData=useData,
keyVar="locale",
plotCaption="Temp, Dew Point, Month/Hour, Clouds, Wind, modSLP",
showHists=TRUE
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The model is frequently not so confident in assigning an archetype to related cities, though it frequently gets the most sensible assignment.
An assessment is run to look at the evolution of the model as it take in the ‘next best’ variable for building out the random forest:
# Define the variables to be considered
possVars <- c("TempF", "DewF", "month", "hr", "minHeight",
"ceilingHeight", "WindSpeed", "predomDir", "modSLP"
)
# Create a container for storing the best random forest and variable added from each run
rfContainer <- vector("list", length=length(possVars))
rfVarAdds <- vector("character", length=length(possVars))
# Run the functions using a for loop over the length of possVars
for (ctr in 1:length(possVars)) {
# Pull in the results of the previous run
if (ctr==1) {
prevVars <- c()
} else {
prevVars <- rfVarAdds[1:(ctr-1)]
}
# Run each of them through the combinations
tmpList <- helperRFCombinations(possVars, df=sub_2015_2018_data, prevVars=prevVars)
# Assess the performance
tmpAccuracy <- helperVariableAccuracy(tmpList, possVars=possVars, prevVars=prevVars)
# Prepare to repeat the process
bestRow <- tmpAccuracy %>%
filter(locale=="OOB") %>%
filter(accuracy==max(accuracy))
bestRow
# Update the rfContainer and rfVarAdds elements
rfContainer[[ctr]] <- tmpList[[as.integer(pull(bestRow, vrblNum))]]
rfVarAdds[ctr] <- pull(bestRow, vrblName)
cat("\nVariable Added:", rfVarAdds[ctr], "\n")
}
##
## Will run for: TempF DewF month hr minHeight ceilingHeight WindSpeed predomDir modSLP
## Fixed variable(s) will be:
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Variable Added: DewF
##
## Will run for: TempF month hr minHeight ceilingHeight WindSpeed predomDir modSLP
## Fixed variable(s) will be: DewF
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Variable Added: TempF
##
## Will run for: month hr minHeight ceilingHeight WindSpeed predomDir modSLP
## Fixed variable(s) will be: DewF TempF
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Variable Added: month
##
## Will run for: hr minHeight ceilingHeight WindSpeed predomDir modSLP
## Fixed variable(s) will be: DewF TempF month
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Variable Added: modSLP
##
## Will run for: hr minHeight ceilingHeight WindSpeed predomDir
## Fixed variable(s) will be: DewF TempF month modSLP
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Variable Added: predomDir
##
## Will run for: hr minHeight ceilingHeight WindSpeed
## Fixed variable(s) will be: DewF TempF month modSLP predomDir
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Variable Added: minHeight
##
## Will run for: hr ceilingHeight WindSpeed
## Fixed variable(s) will be: DewF TempF month modSLP predomDir minHeight
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Variable Added: hr
##
## Will run for: ceilingHeight WindSpeed
## Fixed variable(s) will be: DewF TempF month modSLP predomDir minHeight hr
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Variable Added: WindSpeed
##
## Will run for: ceilingHeight
## Fixed variable(s) will be: DewF TempF month modSLP predomDir minHeight hr WindSpeed
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Variable Added: ceilingHeight
The evolution of accuracy can then be plotted:
# Pull the accuracy data from the variables selected
tblAccuracy <- map_dfr(rfContainer, .f=helperExtractAccuracy, .id="vrblNum") %>%
mutate(vrblName=rfVarAdds[as.integer(vrblNum)],
locale=factor(locale, levels=c("OOB", unique(locale)[unique(locale) != "OOB"])),
plotLabel=ifelse(as.integer(vrblNum)==1, vrblName, paste0("add ", vrblName))
)
# Plot the gains in accuracy, facetted by 'locale'
ggplot(tblAccuracy, aes(x=fct_reorder(plotLabel, -as.integer(vrblNum)))) +
geom_text(aes(y=accuracy, label=paste0(round(100*accuracy), "%"))) +
facet_wrap(~locale, nrow=1) +
ylim(c(0, 1)) +
geom_hline(yintercept=0.25, lty=2) +
labs(x="",
y="",
title="Evolution of accuracy as next-best variables are added",
caption="Null accuracy is 25%"
) +
coord_flip()
As seen previously, dew point, temperature, and month are significant differentiating variables, driving roughly 75% accuracy in classifying the archetypes.
Adding altimeter (SLP) bumps the accuracy by another ~10%, while adding minimum cloud height and prevailing wind direction bump the accuracy by another ~5%.
This shows some of the power of the random forest algorithm as it is given additional variables to explore. Evolution of error can be plotted to see the impact:
oobError <- c()
for (ctr in 1:9) {
oobError <- c(oobError, rfContainer[[ctr]]$rfModel$err.rate[, "OOB"])
}
tibble::tibble(nVars=rep(1:9, each=25),
ntrees=rep(1:25, times=9),
accuracy=1-oobError
) %>%
ggplot(aes(x=ntrees, y=accuracy)) +
geom_line(aes(group=nVars, color=factor(nVars))) +
labs(x="Number of Trees",
y="OOB Accuracy",
title="Accuracy improves with more trees and more variables"
) +
scale_color_discrete("# Variables")
With a greater number of variables, there is a greater lift in accuracy as the forest grows larger.
Suppose that an attempt is made to classify the year for a single city. Example code includes:
# Create a subset of the data for only Chicago (2014-2019)
sub_kord_data <- metarData %>%
mutate(hr=lubridate::hour(dtime)) %>%
filter(str_sub(source, 1 ,4) %in% "kord")
# Run random forest for Chicago 2014-2019 data
rf_kord_TDmcwha <- rfMultiLocale(sub_kord_data,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL,
locVar="year",
pred="year",
ntree=200,
seed=2007011309,
mtry=4
)
##
## Running for locations:
## [1] 2014 2015 2016 2017 2018 2019
Prediction accuracy and variable importance can then be assessed:
# Evaluate prediction accuracy
evalPredictions(rf_kord_TDmcwha,
plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, SLP",
keyVar="year",
locOrder=TRUE
)
## # A tibble: 36 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 2014 2014 TRUE 2016 0.798
## 2 2014 2015 FALSE 136 0.0539
## 3 2014 2016 FALSE 85 0.0337
## 4 2014 2017 FALSE 101 0.04
## 5 2014 2018 FALSE 102 0.0404
## 6 2014 2019 FALSE 85 0.0337
## 7 2015 2014 FALSE 126 0.048
## 8 2015 2015 TRUE 2033 0.774
## 9 2015 2016 FALSE 145 0.0552
## 10 2015 2017 FALSE 103 0.0392
## # ... with 26 more rows
# Evaluate variable importance
helperPlotVarImp(rf_kord_TDmcwha$rfModel)
# Evaluate error evolution
errorEvolution(rf_kord_TDmcwha, useCategory="OOB", subT="Chicago (2014-2019)")
## # A tibble: 200 x 3
## ntree Category Error
## <int> <chr> <dbl>
## 1 1 OOB 0.491
## 2 2 OOB 0.488
## 3 3 OOB 0.489
## 4 4 OOB 0.476
## 5 5 OOB 0.469
## 6 6 OOB 0.454
## 7 7 OOB 0.442
## 8 8 OOB 0.428
## 9 9 OOB 0.418
## 10 10 OOB 0.408
## # ... with 190 more rows
Interestingly, the model predicts the year with 75%-80% accuracy (null accuracy 17%), suggesting there is significant annual variation in Chicago climate. Sea-level-pressure has the largest variable importance (SLP is a mix of altitude, temperature, dew point, and high/low pressure systems). The first 50 trees significantly reduce the OOB error, with more modest, but still continuing, error reduction afterwards.
The process is run for Las Vegas:
# Create a subset of the data for only Las Vegas (2014-2019)
sub_klas_data <- metarData %>%
mutate(hr=lubridate::hour(dtime)) %>%
filter(str_sub(source, 1 ,4) %in% "klas")
# Run random forest for Las Vegas 2014-2019 data
rf_klas_TDmcwha <- rfMultiLocale(sub_klas_data,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL,
locVar="year",
pred="year",
ntree=200,
seed=2007011405,
mtry=4
)
##
## Running for locations:
## [1] 2014 2015 2016 2017 2018 2019
Prediction accuracy and variable importance can then be assessed:
# Evaluate prediction accuracy
evalPredictions(rf_klas_TDmcwha,
plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, SLP",
keyVar="year",
locOrder=TRUE
)
## # A tibble: 36 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 2014 2014 TRUE 1885 0.730
## 2 2014 2015 FALSE 175 0.0678
## 3 2014 2016 FALSE 142 0.0550
## 4 2014 2017 FALSE 149 0.0577
## 5 2014 2018 FALSE 109 0.0422
## 6 2014 2019 FALSE 121 0.0469
## 7 2015 2014 FALSE 197 0.0757
## 8 2015 2015 TRUE 1711 0.658
## 9 2015 2016 FALSE 162 0.0623
## 10 2015 2017 FALSE 169 0.0650
## # ... with 26 more rows
# Evaluate variable importance
helperPlotVarImp(rf_klas_TDmcwha$rfModel)
# Evaluate error evolution
errorEvolution(rf_klas_TDmcwha, useCategory="OOB", subT="Las Vegas (2014-2019)")
## # A tibble: 200 x 3
## ntree Category Error
## <int> <chr> <dbl>
## 1 1 OOB 0.576
## 2 2 OOB 0.569
## 3 3 OOB 0.565
## 4 4 OOB 0.560
## 5 5 OOB 0.554
## 6 6 OOB 0.548
## 7 7 OOB 0.536
## 8 8 OOB 0.527
## 9 9 OOB 0.518
## 10 10 OOB 0.511
## # ... with 190 more rows
Prediction accuracies are lower for Las Vegas, averaging 60%-75% (2014 appears to be much more differentiated than the other years). This is suggestive that there is less annual variation in the Las Vegas climate than there is in the Chicago climate, though still enough to meaningfully pull apart the years.
Modified seal-level pressure is the top predictor, and the majority of the OOB error reduction occurs in the first 50 trees.
Suppose that modSLP is eliminated as a predictor variable, and the process is scaled down to 50 trees and repeated for Chicago:
# Run random forest for Chicago 2014-2019 data
rf_kord_TDmcwh_50 <- rfMultiLocale(sub_kord_data,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir"
),
locs=NULL,
locVar="year",
pred="year",
ntree=50,
seed=2007011410,
mtry=4
)
##
## Running for locations:
## [1] 2014 2015 2016 2017 2018 2019
Prediction accuracy and variable importance can then be assessed:
# Evaluate prediction accuracy
evalPredictions(rf_kord_TDmcwh_50,
plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind",
keyVar="year",
locOrder=TRUE
)
## # A tibble: 36 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 2014 2014 TRUE 1621 0.629
## 2 2014 2015 FALSE 194 0.0753
## 3 2014 2016 FALSE 175 0.0679
## 4 2014 2017 FALSE 217 0.0842
## 5 2014 2018 FALSE 197 0.0764
## 6 2014 2019 FALSE 174 0.0675
## 7 2015 2014 FALSE 240 0.0901
## 8 2015 2015 TRUE 1554 0.583
## 9 2015 2016 FALSE 238 0.0893
## 10 2015 2017 FALSE 196 0.0736
## # ... with 26 more rows
# Evaluate variable importance
helperPlotVarImp(rf_kord_TDmcwh_50$rfModel)
# Compare error evolution through the first 50 trees
tibble::tibble(ntree=1:50,
withSLP=rf_kord_TDmcwha$rfModel$err.rate[1:50, "OOB"],
noSLP=rf_kord_TDmcwh_50$rfModel$err.rate[1:50, "OOB"]
) %>%
pivot_longer(-ntree, names_to="Model", values_to="Error") %>%
ggplot(aes(x=ntree, y=Error, color=Model)) +
geom_line() +
geom_text(data=~filter(., ntree %in% c(1, 50)),
aes(y=Error+0.02, label=paste0(round(100*Error, 1), "%"))
) +
labs(x="# Trees",
y="OOB Error Rate",
title="OOB Error Evolution for SLP Included/Excluded"
) +
ylim(c(0, NA))
Modified sea-level pressure is revealed to be a significant driver of prediction accuracy for Chicago, confirming findings of the previous variable importance analysis. This suggests different patterns of high and low pressure being in control by year may meaningfully improve the ability to predict Chicago by year.
The process can also be run for Las Vegas:
# Run random forest for Las Vegas 2014-2019 data
rf_klas_TDmcwh_50 <- rfMultiLocale(sub_klas_data,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir"
),
locs=NULL,
locVar="year",
pred="year",
ntree=50,
seed=2007011420,
mtry=4
)
##
## Running for locations:
## [1] 2014 2015 2016 2017 2018 2019
Prediction accuracy and variable importance can then be assessed:
# Evaluate prediction accuracy
evalPredictions(rf_klas_TDmcwh_50,
plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind",
keyVar="year",
locOrder=TRUE
)
## # A tibble: 36 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 2014 2014 TRUE 1610 0.630
## 2 2014 2015 FALSE 258 0.101
## 3 2014 2016 FALSE 177 0.0693
## 4 2014 2017 FALSE 194 0.0759
## 5 2014 2018 FALSE 148 0.0579
## 6 2014 2019 FALSE 168 0.0658
## 7 2015 2014 FALSE 298 0.115
## 8 2015 2015 TRUE 1338 0.516
## 9 2015 2016 FALSE 228 0.0880
## 10 2015 2017 FALSE 227 0.0876
## # ... with 26 more rows
# Evaluate variable importance
helperPlotVarImp(rf_klas_TDmcwh_50$rfModel)
# Compare error evolution through the first 50 trees
tibble::tibble(ntree=1:50,
withSLP=rf_klas_TDmcwha$rfModel$err.rate[1:50, "OOB"],
noSLP=rf_klas_TDmcwh_50$rfModel$err.rate[1:50, "OOB"]
) %>%
pivot_longer(-ntree, names_to="Model", values_to="Error") %>%
ggplot(aes(x=ntree, y=Error, color=Model)) +
geom_line() +
geom_text(data=~filter(., ntree %in% c(1, 50)),
aes(y=Error+0.02, label=paste0(round(100*Error, 1), "%"))
) +
labs(x="# Trees",
y="OOB Error Rate",
title="OOB Error Evolution for SLP Included/Excluded"
) +
ylim(c(0, NA))
Modified sea-level pressure is revealed as a meaningful driver of prediction accuracy for Las Vegas also, though with a lesser effect than seen in Chicago. This potentially suggests less variability by year in SLP for Las Vegas.
Since SLP is so meaningful, are there any patterns by year?
sub_kord_data %>%
bind_rows(sub_klas_data, .id="cityFile") %>%
select(cityFile, year, month, modSLP) %>%
mutate(city=case_when(cityFile==1 ~ "Chicago", cityFile==2 ~ "Las Vegas", TRUE ~ "Error")) %>%
filter(!is.na(modSLP)) %>%
ggplot(aes(x=factor(year), y=modSLP)) +
geom_boxplot(fill="lightblue") +
facet_wrap(~city) +
labs(x="", y="modSLP", title="Modified Sea-Level Pressure by Year (Chicago and Las Vegas)")
Chicago has a higher sea-level pressure than Las Vegas (as expected), though neither city shows much variation on average from year to year. The Chicago data are explored further by month:
sub_kord_data %>%
select(year, month, modSLP) %>%
filter(!is.na(modSLP)) %>%
ggplot(aes(x=month, y=modSLP)) +
geom_boxplot(fill="lightblue") +
facet_wrap(~year) +
labs(x="", y="modSLP", title="Modified Sea-Level Pressure by Month and Year (Chicago)")
sub_kord_data %>%
select(year, month, modSLP) %>%
filter(!is.na(modSLP)) %>%
group_by(year, month) %>%
summarize(medSLP=median(modSLP)) %>%
ggplot(aes(x=month, y=medSLP, color=factor(year), group=year)) +
geom_line(lwd=1) +
labs(x="", y="Median modSLP", title="Median modified Sea-Level Pressure by Month and Year (Chicago)")
sub_kord_data %>%
select(year, month, modSLP) %>%
filter(!is.na(modSLP)) %>%
group_by(year, month) %>%
summarize(medSLP=median(modSLP)) %>%
group_by(month) %>%
summarize(sdSLP=sd(medSLP)) %>%
ggplot(aes(x=month, y=sdSLP)) +
geom_point() +
labs(x="",
y="Standard Deviation of Median modSLP by Year",
title="Variability by year of modified Sea-Level Pressure by Month (Chicago)"
)
There are meaningful differences in median modified sea-level pressure by month by year. This is suggestive that several year of data are needed to define an archetype, as there is risk that otherwise the archetype will be an anomalous “high pressure was in contol in June” when that is not generally applicable. And, since high vs. low pressure are often related to temperature, dew point, coludiness, wind speed, and the like, the need for multiple years of data to define an archetype likely extends to all the other variables as well.
With greater variability in SLP by year in the cold-weather months, predictive ability in the cold-weather months is expected to be higher:
rf_kord_TDmcwha$testData %>%
group_by(month) %>%
summarize(pctError=1-mean(correct)) %>%
ggplot(aes(x=month, y=pctError)) +
geom_col(fill="lightblue") +
geom_text(aes(y=pctError/2, label=paste0(round(100*pctError, 1), "%"))) +
labs(x="", y="Error Rate", title="Prediction Error Rate by Month (Chicago)")
As expected, the model takes advantage of greater annual variation in modSLP during the cold season to make better prediction of which year it is during the cold season.
Next an attempt is made to predict the month given the data:
# Run random forest for Las Vegas 2014-2019 data
rf_klas_TDycwha_month_50 <- rfMultiLocale(sub_klas_data,
vrbls=c("TempF", "DewF",
"year", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL,
locVar="month",
pred="month",
ntree=50,
seed=2007021356,
mtry=4
)
##
## Running for locations:
## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
Prediction accuracy and variable importance can then be assessed:
# Evaluate prediction accuracy
evalPredictions(rf_klas_TDycwha_month_50,
plotCaption = "Temp, Dew Point, Year, Hour of Day, Cloud Height, Wind, Altimeter",
keyVar="month",
locOrder=TRUE
)
## # A tibble: 105 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Jan Jan TRUE 1087 0.809
## 2 Jan Feb FALSE 50 0.0372
## 3 Jan Mar FALSE 27 0.0201
## 4 Jan Apr FALSE 6 0.00447
## 5 Jan May FALSE 5 0.00372
## 6 Jan Nov FALSE 50 0.0372
## 7 Jan Dec FALSE 118 0.0879
## 8 Feb Jan FALSE 75 0.0626
## 9 Feb Feb TRUE 820 0.684
## 10 Feb Mar FALSE 105 0.0876
## # ... with 95 more rows
# Evaluate variable importance
helperPlotVarImp(rf_klas_TDycwha_month_50$rfModel)
# Evaluate error evolution
errorEvolution(rf_klas_TDycwha_month_50, useCategory="OOB", subT="Las Vegas (2014-2019)")
## # A tibble: 50 x 3
## ntree Category Error
## <int> <chr> <dbl>
## 1 1 OOB 0.536
## 2 2 OOB 0.533
## 3 3 OOB 0.528
## 4 4 OOB 0.521
## 5 5 OOB 0.514
## 6 6 OOB 0.501
## 7 7 OOB 0.494
## 8 8 OOB 0.481
## 9 9 OOB 0.472
## 10 10 OOB 0.462
## # ... with 40 more rows
There is meaningful seasonality to the Las Vegas data such that the predicted month is frequently either the same season or (for spring/fall) the mirror season.
Suppose that year is not available as a predictor:
# Run random forest for Las Vegas 2014-2019 data
rf_klas_TDcwha_month_50 <- rfMultiLocale(sub_klas_data,
vrbls=c("TempF", "DewF",
"hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL,
locVar="month",
pred="month",
ntree=50,
seed=2007021402,
mtry=4
)
##
## Running for locations:
## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
Prediction accuracy and variable importance can then be assessed:
# Evaluate prediction accuracy
evalPredictions(rf_klas_TDcwha_month_50,
plotCaption = "Temp, Dew Point, Hour of Day, Cloud Height, Wind, Altimeter",
keyVar="month",
locOrder=TRUE
)
## # A tibble: 105 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Jan Jan TRUE 920 0.697
## 2 Jan Feb FALSE 85 0.0644
## 3 Jan Mar FALSE 33 0.025
## 4 Jan Apr FALSE 6 0.00455
## 5 Jan May FALSE 5 0.00379
## 6 Jan Nov FALSE 70 0.0530
## 7 Jan Dec FALSE 201 0.152
## 8 Feb Jan FALSE 112 0.0945
## 9 Feb Feb TRUE 612 0.516
## 10 Feb Mar FALSE 139 0.117
## # ... with 95 more rows
# Evaluate variable importance
helperPlotVarImp(rf_klas_TDcwha_month_50$rfModel)
# Evaluate error evolution
errorEvolution(rf_klas_TDcwha_month_50, useCategory="OOB", subT="Las Vegas (2014-2019)")
## # A tibble: 50 x 3
## ntree Category Error
## <int> <chr> <dbl>
## 1 1 OOB 0.618
## 2 2 OOB 0.616
## 3 3 OOB 0.611
## 4 4 OOB 0.605
## 5 5 OOB 0.599
## 6 6 OOB 0.593
## 7 7 OOB 0.587
## 8 8 OOB 0.580
## 9 9 OOB 0.575
## 10 10 OOB 0.565
## # ... with 40 more rows
The same seasonal pattern is observed, though prediction accuracy falls be ~15%. This is suggestive that it is not uncommon for Las Vegas climate to run a month-ahead or a month-behind where it ran in a previous year.
The model appears to still be learning at 50 trees, so it is expanded to 150 trees:
# Run random forest for Las Vegas 2014-2019 data
rf_klas_TDcwha_month_150 <- rfMultiLocale(sub_klas_data,
vrbls=c("TempF", "DewF",
"hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL,
locVar="month",
pred="month",
ntree=150,
seed=2007021402,
mtry=4
)
##
## Running for locations:
## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
Prediction accuracy and variable importance can then be assessed:
# Evaluate prediction accuracy
evalPredictions(rf_klas_TDcwha_month_150,
plotCaption = "Temp, Dew Point, Hour of Day, Cloud Height, Wind, Altimeter",
keyVar="month",
locOrder=TRUE
)
## # A tibble: 105 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Jan Jan TRUE 927 0.702
## 2 Jan Feb FALSE 75 0.0568
## 3 Jan Mar FALSE 38 0.0288
## 4 Jan Apr FALSE 6 0.00455
## 5 Jan May FALSE 5 0.00379
## 6 Jan Nov FALSE 69 0.0523
## 7 Jan Dec FALSE 200 0.152
## 8 Feb Jan FALSE 91 0.0768
## 9 Feb Feb TRUE 623 0.526
## 10 Feb Mar FALSE 131 0.111
## # ... with 95 more rows
# Evaluate variable importance
helperPlotVarImp(rf_klas_TDcwha_month_150$rfModel)
# Evaluate error evolution
errorEvolution(rf_klas_TDcwha_month_150, useCategory="OOB", subT="Las Vegas (2014-2019)")
## # A tibble: 150 x 3
## ntree Category Error
## <int> <chr> <dbl>
## 1 1 OOB 0.618
## 2 2 OOB 0.616
## 3 3 OOB 0.611
## 4 4 OOB 0.605
## 5 5 OOB 0.599
## 6 6 OOB 0.593
## 7 7 OOB 0.587
## 8 8 OOB 0.580
## 9 9 OOB 0.575
## 10 10 OOB 0.565
## # ... with 140 more rows
Another topic of interest is the use of sea-level pressure or altimeter in modeling. The altimeter setting in a METAR is captured in two ways:
In theory, both metrics report almost the same thing but in different units. There is a small difference in that SLP makes an adjustment for average 12-hour temperature (since temperature is a driver of air pressure) while altimeter is purely an adjustment for altitude.
How does the model perform with neither/either/both or Altimeter and modSLP included? An example will be used for trying to classify a handful of cities from 2016:
sub_mini_2016_data <- metarData %>%
mutate(hr=lubridate::hour(dtime)) %>%
filter(year==2016,
str_sub(source, 1 ,4) %in% c("kord", "kmke", "kmsy", "klas", "ksan", "kden", "kbos", "ksea")
)
First, a model is run with neither Altimeter nor SLP:
# Run random forest for subset data (2016)
rf_mini_TDmcwh <- rfMultiLocale(sub_mini_2016_data,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir"
),
locs=NULL,
locVar="locale",
pred="locale",
ntree=100,
seed=2007021422,
mtry=4
)
##
## Running for locations:
## [1] "Boston, MA (2016)" "Chicago, IL (2016)" "Denver, CO (2016)"
## [4] "Las Vegas, NV (2016)" "Milwaukee, WI (2016)" "New Orleans, LA (2016)"
## [7] "San Diego, CA (2016)" "Seattle, WA (2016)"
Prediction accuracy and variable importance can then be assessed:
# Evaluate prediction accuracy
evalPredictions(rf_mini_TDmcwh,
plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind",
keyVar="locale",
locOrder=NULL
)
## # A tibble: 64 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Boston, MA (2016) Boston, MA (2016) TRUE 1800 0.695
## 2 Boston, MA (2016) Chicago, IL (2016) FALSE 186 0.0718
## 3 Boston, MA (2016) Denver, CO (2016) FALSE 158 0.0610
## 4 Boston, MA (2016) Las Vegas, NV (2016) FALSE 25 0.00965
## 5 Boston, MA (2016) Milwaukee, WI (2016) FALSE 204 0.0788
## 6 Boston, MA (2016) New Orleans, LA (2016) FALSE 39 0.0151
## 7 Boston, MA (2016) San Diego, CA (2016) FALSE 41 0.0158
## 8 Boston, MA (2016) Seattle, WA (2016) FALSE 137 0.0529
## 9 Chicago, IL (2016) Boston, MA (2016) FALSE 167 0.0641
## 10 Chicago, IL (2016) Chicago, IL (2016) TRUE 1538 0.590
## # ... with 54 more rows
# Evaluate variable importance
helperPlotVarImp(rf_mini_TDmcwh$rfModel)
# Evaluate error evolution
errorEvolution(rf_mini_TDmcwh, useCategory="OOB", subT="No Altimeter, No SLP")
## # A tibble: 100 x 3
## ntree Category Error
## <int> <chr> <dbl>
## 1 1 OOB 0.374
## 2 2 OOB 0.373
## 3 3 OOB 0.369
## 4 4 OOB 0.363
## 5 5 OOB 0.355
## 6 6 OOB 0.345
## 7 7 OOB 0.335
## 8 8 OOB 0.328
## 9 9 OOB 0.317
## 10 10 OOB 0.309
## # ... with 90 more rows
Overall accuracy is roughly 75%, with the greatest challenge being separating Chicago and Milwaukee, with Boston also showing some similarities to these cities.
The process is then run with both Altimeter and SLP:
# Run random forest for subset data (2016)
rf_mini_TDmcwhas <- rfMultiLocale(sub_mini_2016_data,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP", "Altimeter"
),
locs=NULL,
locVar="locale",
pred="locale",
ntree=100,
seed=2007021429,
mtry=4
)
##
## Running for locations:
## [1] "Boston, MA (2016)" "Chicago, IL (2016)" "Denver, CO (2016)"
## [4] "Las Vegas, NV (2016)" "Milwaukee, WI (2016)" "New Orleans, LA (2016)"
## [7] "San Diego, CA (2016)" "Seattle, WA (2016)"
Prediction accuracy and variable importance can again be assessed:
# Evaluate prediction accuracy
evalPredictions(rf_mini_TDmcwhas,
plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, Altimeter, and SLP",
keyVar="locale",
locOrder=NULL
)
## Warning: Removed 1 rows containing missing values (geom_text).
## # A tibble: 62 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Boston, MA (2016) Boston, MA (2016) TRUE 2131 0.837
## 2 Boston, MA (2016) Chicago, IL (2016) FALSE 90 0.0354
## 3 Boston, MA (2016) Denver, CO (2016) FALSE 44 0.0173
## 4 Boston, MA (2016) Las Vegas, NV (2016) FALSE 25 0.00982
## 5 Boston, MA (2016) Milwaukee, WI (2016) FALSE 109 0.0428
## 6 Boston, MA (2016) New Orleans, LA (2016) FALSE 18 0.00707
## 7 Boston, MA (2016) San Diego, CA (2016) FALSE 30 0.0118
## 8 Boston, MA (2016) Seattle, WA (2016) FALSE 98 0.0385
## 9 Chicago, IL (2016) Boston, MA (2016) FALSE 112 0.0412
## 10 Chicago, IL (2016) Chicago, IL (2016) TRUE 1852 0.681
## # ... with 52 more rows
# Evaluate variable importance
helperPlotVarImp(rf_mini_TDmcwhas$rfModel)
# Evaluate error evolution
errorEvolution(rf_mini_TDmcwhas, useCategory="OOB", subT="Both Altimeter and SLP")
## # A tibble: 100 x 3
## ntree Category Error
## <int> <chr> <dbl>
## 1 1 OOB 0.316
## 2 2 OOB 0.319
## 3 3 OOB 0.319
## 4 4 OOB 0.310
## 5 5 OOB 0.299
## 6 6 OOB 0.291
## 7 7 OOB 0.282
## 8 8 OOB 0.271
## 9 9 OOB 0.261
## 10 10 OOB 0.251
## # ... with 90 more rows
Overall accuracy increases to about 85%. Except for Chicago/Milwaukee, every city is overwhelmingly classified as itself.
Smaller runs with just SLP and just Altimeter are conducted, to verify that the learning rate is roughly the same in these cases. As well SLP plus a dummy variable (SLP + rnorm()) is run:
# Run random forest for subset data (2016)
rf_mini_TDmcwhs <- rfMultiLocale(sub_mini_2016_data,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL,
locVar="locale",
pred="locale",
ntree=50,
seed=2007021435,
mtry=4
)
##
## Running for locations:
## [1] "Boston, MA (2016)" "Chicago, IL (2016)" "Denver, CO (2016)"
## [4] "Las Vegas, NV (2016)" "Milwaukee, WI (2016)" "New Orleans, LA (2016)"
## [7] "San Diego, CA (2016)" "Seattle, WA (2016)"
# Run random forest for subset data (2016)
rf_mini_TDmcwha <- rfMultiLocale(sub_mini_2016_data,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"Altimeter"
),
locs=NULL,
locVar="locale",
pred="locale",
ntree=50,
seed=2007021440,
mtry=4
)
##
## Running for locations:
## [1] "Boston, MA (2016)" "Chicago, IL (2016)" "Denver, CO (2016)"
## [4] "Las Vegas, NV (2016)" "Milwaukee, WI (2016)" "New Orleans, LA (2016)"
## [7] "San Diego, CA (2016)" "Seattle, WA (2016)"
set.seed(2007021444)
subUseData <- sub_mini_2016_data %>%
mutate(dummy=modSLP + rnorm(n=nrow(sub_mini_2016_data)))
# Run random forest for subset data (2016)
rf_mini_TDmcwhsd <- rfMultiLocale(subUseData,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP", "dummy"
),
locs=NULL,
locVar="locale",
pred="locale",
ntree=50,
seed=2007021445,
mtry=4
)
##
## Running for locations:
## [1] "Boston, MA (2016)" "Chicago, IL (2016)" "Denver, CO (2016)"
## [4] "Las Vegas, NV (2016)" "Milwaukee, WI (2016)" "New Orleans, LA (2016)"
## [7] "San Diego, CA (2016)" "Seattle, WA (2016)"
OOB error evolution through the first 50 trees can be compared:
# Compare error evolution through the first 50 trees
tibble::tibble(ntree=1:50,
neither=rf_mini_TDmcwh$rfModel$err.rate[1:50, "OOB"],
altimeter=rf_mini_TDmcwha$rfModel$err.rate[1:50, "OOB"],
modslp=rf_mini_TDmcwhs$rfModel$err.rate[1:50, "OOB"],
slp_dummy=rf_mini_TDmcwhsd$rfModel$err.rate[1:50, "OOB"],
both=rf_mini_TDmcwhas$rfModel$err.rate[1:50, "OOB"]
) %>%
pivot_longer(-ntree, names_to="Model", values_to="Error") %>%
ggplot(aes(x=ntree, y=Error, color=Model)) +
geom_line() +
geom_text(data=~filter(., ntree %in% c(1, 50)),
aes(y=Error+0.02, label=paste0(round(100*Error, 1), "%"))
) +
labs(x="# Trees",
y="OOB Error Rate",
title="OOB Error Evolution for SLP Included/Excluded"
) +
ylim(c(0, NA))
Interestingly, the inclusion of both Altimeter and modSLP appears to drive a 2%-3% decrease in classification error. The main difference between the variables is that modSLP contains a hint about the previous 12-hour temperature.
Importantly, including a dummy variable with modSLP has either no impact or a slight negative impact on model performance. So, the addition of Altimeter to modSLP is likely a real impact, not merely the inclusion of two highly correlated variables that are both valuable to driving model performance.